Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity
نویسنده
چکیده
We explore the selection of training data for language models using perplexity. We introduce three novel models that make use of linguistic information and evaluate them on three different corpora and two languages. In four out of the six scenarios a linguistically motivated method outperforms the purely statistical state-of-theart approach. Finally, a method which combines surface forms and the linguistically motivated methods outperforms the baseline in all the scenarios, selecting data whose perplexity is between 3.49% and 8.17% (depending on the corpus and language) lower than that of the baseline.
منابع مشابه
Neural Network Language Models for Candidate Scoring in Hybrid Multi-System Machine Translation
This paper presents the comparison of how using different neural network based language modelling tools for selecting the best candidate fragments affects the final output translation quality in a hybrid multi-system machine translation setup. Experiments were conducted by comparing perplexity and BLEU scores on common test cases using the same training data set. A 12-gram statistical language ...
متن کاملAdaptive Hybrid POS Cache based Semantic Language Model
This paper presents a language model as an improvement over the stochastic language model for developing a syntactic structure based on word dependencies in local and non local domain. The model copes with the issues of limited amount of training material and the exploitation of the linguistic constraints of the language. The proposed model is a dynamic probabilistic model which uses word depen...
متن کاملSelection Criteria for Word Trigger Pairsin Language
In this paper, we study selection criteria for the use of word trigger pairs in statistical language modeling. A word trigger pair is de-ned as a long-distance word pair. To select the most signiicant trigger pairs, we need suitable criteria which are the topics of this paper. We extend a baseline language model by a single word trigger pair and use the perplexity of this extended language mode...
متن کاملMethod of Selecting Training Sets to Build Compact and Efficient Statistical Language Model
For statistical language model training, target task matched corpora are required. However, training corpora sometimes include both target task matched and unmatched sentences. In such a case, training set selection is effective for both model size reduction and model performance improvement. In this paper, training set selection method for statistical language model training is described. The ...
متن کاملJoint and Coupled Bilingual Topic Model Based Sentence Representations for Language Model Adaptation
This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional approaches have gained significant performance, they ignore the topic information and the distribution information of words when selecting similar training senten...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013